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Abstract 

Wc sharply characterize the performance of different penalization schemes for the problem 
of selecting the relevant variables in the multi-task setting. Previous work focuses on 
the regression problem where conditions on the design matrix complicate the analysis. A 
clearer and simpler picture emerges by studying the Normal means model. This model, 
often used in the field of statistics, is a simplified model that provides a laboratory for 
studying complex procedures. 

Keywords: high-dimensional inference, multi-task learning, sparsity, Normal means, 
minimax estimation 
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1. Introduction 

We consider the problem of estimating a sparse, signal in the presence of noise. I t has been 



empir ically observed, on various data sets ranging from cogn i tive n euroscience iLiu et al 



;o genome- wide association mapping studies iKim et al.l (|200d ) , ;hat considering re- 
lated estimation tasks jointly, improves estimation performance. Because of this, joint esti- 
mation from related tasks or multi-task learning h as received much attention in the machine 
learning and statistics community (see for example IZhana. 120061; iNegahban and Wainwright 



2008; 



Kim et al 



20091 : lObozinski et al!l2010l : iLounici et all . l200d : ILiu et all l200d : ILounici et alJ . l2010l : lArgvriou et al. 
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9j, and references therein). However, the theory behind multi-task 



learning is not yet settled. 

An example of multi-task learning problem is the problem of estimating the coefficients 
of several multiple regression problems 



Yj = + e ii 3 e [k] 



(1) 



where X, S M. nxp is the design matrix, y,- G M n is the vector of observations, e,- G M. n is the 
noise vector, (3j G W is the unknown vector of regression coefficients for the j-th task and 
[n] = {l,...,nj. 

When the number of variables p is much larger than the sample size n, it is commonly 
assumed that the regression coefficients are jointly sparse, that is, there exists a small subset 
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S C [p], with s := 1 5 1 <C n, of the regression coefficients that are non-zero for all or most of 
the tasks. 

The model in CD) under the joint spar si ty assumption was analyzed in, f or example, 
Obozinski et al.l (bold) . lLounici et al] d2009h . iNegahban and Wainwrightl (120091) . lLounici et al. 
(j2010h and lKolar and Xingj (j2010h . lObozinski et all (|2O10l i" propose to minimize the penal- 
ized least squares objective with the mixed (2, l)-norm of the coefficients as the penalty 
term. The authors focus on consistent estimati on of the support set S, albeit u nder the 
assumption that the number of tasks k is fixed. Negahban and Wainwright ( 20091 ) use the 
mixed (oo, l)-norm of the coefficients as the penalty term instead and focus on the exact 
recovery of the non-zero pattern of the regression coefficients, rather than the support set S. 
For a rather limited case of k = 2, the authors show that when the regression do not share 
a common support, it may be harmful to consider the regression problems jointly using the 
mixed (oo, l)-norm penalty. iKolar and Xing (120101 ) address the feature selection properties 
of the simultaneous greedy forward selection, however, it is not clear what t he benefits are 



compa red to the ordinary forw ard selection done on each task separately. In lLounici et al 



(|2009h and lLounici et ali (l2010h . i;he focus is shifted from the consistent selection to benefits 
of the joint estimation for the prediction accuracy and consistent estimation. The number 
of tasks k is allowed to increase with the sample size, however, it is assumed that all tasks 
share the same features, that is, a relevant coefficient is non-zero for all tasks. 

Despite these previous investigations, the theory is far from settled. A simple clear 
picture of when sharing between tasks actually improves performance has not emerged. In 
particular, to the best of our knowledge, there has been no previous work that sharply 
characterizes the performance of different penalization schemes on the problem of selecting 
the relevant variables in the multi-task setting. 

In this paper we study multi-task learning in the context of the many Normal means 
model. This is a simplified model that is often useful for studying the theoretical properties 
of procedures. The use of the many Normal means model is fairly common in statistics but 
appears to be less common in machine learning. 



1.1 The Normal Means Model 

The simplest Normal means model has the form 

Yi = m + aei, i 



,P 



(2) 



where fii,... ,fJ- p are unknown parameters and e\,...,e p are independent, identically dis- 
tributed Normal random variables with mean and variance 1. There are a variety of 
results ( Brown and Low ( 19961 ). Nussbaum ( 19961 )) that show that many learning problems 
can be converted into a Normal means problem. This implies that results obtained in the 
Normal means setting can be transferred to many other settings. As a simple example, 
consider the nonparametric regression model Z% = m(i/n) + 5i where m is a smooth func- 
tion on [0, 1] and <5j ~ N(0, 1). Let <j>i, (fa, . . . , be an orthonormal basis on [0,1] and write 
m(x) = X^=i ^j^ji 00 ) where fij = J Q m{x)<f)j(x)dx. To estimate the regression function m 
we need only estimate fj,±, m, . . . ,. Let Yj = n~ l X^"=i ^ i 4>j{^/ n )- Then Yj N(fij,a 2 ) 
where a 2 = 1/n. This has the form of ([2]) with a = 1/y/n. Hence this regression problem 
can be converted into a Normal means model. 
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However, the most important aspect of the Normal means model is that it allows a 
clean setting for studying complex problems. In this paper, we consider the following 
Normal means model. Let 



Yi 



>j 



;i-e)AA(0,a 2 ) + eAA( Wi , 
N(0,a 2 ) 



* 2 ) 



3 e [k], 
3 G [k], 



i G S 
i G S c 



(3) 



where (Hij)ij are unknown real numbers, a = a§j\fn is the variance with o"o > known, 
(Yij)ij are random observations, e G [0, 1] is the parameter that controls the sparsity of 
features across tasks and S C [p] is the set of relevant features. Let s = \S\ denote the 
number of relevant features. Denote the matrix M G R pxfc of means 

Tasks 





1 


2 


. k 


1 


Mil 


A*12 • 


• Mlfc 


2 




^22 • 


■ H2k 


P 




Mp2 ■ 





and let 0,- 



(/' 



ij)je[k] 



denote the i-th. row of the matrix M. The set S c = \p]\S indexes 



the zero rows of the matrix M and the associated observations are distributed according 
to the normal distribution with zero mean and variance a 2 . The rows indexed by S are 
non-zero and the corresponding observation are coming from a mixture of two normal 
distributions. The parameter e determines the proportion of observations coming from a 
normal distribution with non-zero mean. The reader should regard each column as one 
vector of parameters that we want to estimate. The question is whether sharing across 
columns improves the estimation performance. 

It is known from the work on the Lasso that in regression problems, the design matrix 
needs to satisfy certain conditions in ord er for the Lasso to correctly identify the support 



S (see Ivan de Geer and Biihlmannl . I2009I . for an extensive discussion on the different condi- 
tions). These regularity conditions are essentially unavoidable. However, the Normal means 
model (J3j) allows us to analyze the estimation procedure in ([5]) and focus on the scaling of 
the important parameters (n, k,p, s, e, fJ, m m) for the success of the support recovery. Using 
the model ([3]) and the estimation procedure in ([5]), we are able to identify regimes in which 
estimating the support is more efficient using the ordinary Lasso than with the multi-task 
Lasso and vice versa. Our results suggest that multi-task Lasso does not outperform the 
ordinary Lasso when the features are not considerably shared across tasks and practition- 
ers should be careful when applying the multi-task Lasso without knowledge of the task 
structure. 



An alternative representation of the model is 



Y 

1 in 



N(0,a 2 ) 



3 € [k], 
3 € [k], 



i G S 
i G 5 C 



(4) 



where £jj is a Bernoulli random variable with success probability e. Throughout the paper, 
we will set e = k~@ for some parameter (3 G [0, 1). j3 < 1/2 corresponds to dense rows 
and /3 > 1/2 corresponds to sparse rows. Let // m i n denote the absolute value of a smallest 
non-zero element of M, /i m i n = min \ fJLij\- 
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Under the model (J3j) , we analyze the penalized least squares procedures of the form 

/£ = argmin -||Y — + pen(|«) (5) 

where ||^4||_f = Y2jk ^% * s * ne Frobenious norm, pen(-) is a penalty function and fi is a 
p x k matrix of means. We consider the following penalties 

1. the i\ penalty 

pen(/x) = A ^ Mi 

ie]p]je[k] 

which corresponds to the Lasso procedure applied on each task independently, and 
denote the resulting estimate as 



2. the mixed (2, l)-norm penalty 



pen(/x) = A ^ H^lb) 

»6[p] 



which corresponds t o the multi-task Lasso formulation in lObozinski et al.1 (|2010h and 



Lounici et all (|2009h . and denote the resulting estimate as x ' 2 



3. the mixed (oo, l)-norm penalty 



pen(/Lt) = A Halloo, 
ie\p] 



which correspond to the multi-task Lasso formulation in iNegahban and Wainwright 
(|2009l ). and denote the resulting estimate as p, 1 ' 2 . 



For any solution /I of ([5]), let denote the set of estimated non-zero rows 

S(fi) = {i€]p] : (6) 

We establish sufficient conditions under which ¥[S(p,) ^ S] < a for different methods. 
These results are complemented with necessary conditions for the recovery of the support 
set S. 

1.2 Overview of the main results 

The main contributions of the paper can be summarized as follows. 

1. We establish a lower bound on the parameter /Li m ; n as a function of the parameters 
(n,k,p,s, /3). Our result can be interpreted as follows: for any estimation procedure 
there exists a model given by ([3|) with non-zero elements equal to /U m i n such that the 
estimation procedure will make an error when identifying the set S with probability 
bounded away from zero. 
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2. We establish the sufficient conditions on the signal strength fimm for the Lasso and 
both variants of the group Lasso under which these procedures can correctly identify 
the set of non-zero rows S. 

By comparing the lower bounds with the sufficient conditions, we are able to identify 
regimes in which each procedure is optimal for the problem of identifying the set of non- 
zero rows S. Furthermore, we point out that the usage of the popular group Lasso with the 
mixed (oo, 1) norm can be disastrous when features are not perfectly shared among tasks. 
This is further demonstrated using through an empirical study. 

1.3 Organization of the paper 

The paper is organizes as follows. We start by analyzing the lower bound for any procedure 
for the problem of identifying the set of non-zero rows in §[2j In ^3] we provide sufficient 
conditions on the signal strength /i m i n for the Lasso and the group Lasso to be able to detect 
the set of non-zero rows S. In the following section, we propose an improved approach to 
the problem of estimating the set S. Results of a small empirical study are reported in <j5j 
We close the paper by a discussion of our findings. 

2. Lower bound on the support recovery 

In this section, we derive a lower bound for the problem of identifying the correct variables. 
In particular, we derive conditions on (n, k,p, s, e, ^ m i n ) under which any method is going 
to make an error when estimating the correct variables. Intuitively, if /x m i n is very small, a 
non-zero row may be hard to distinguish from a zero row. Similarly, if e is very small, many 
elements in a row will zero and, again, as a result it may be difficult to identify a non-zero 
row. Before, we give the main result of the section, we introduce the class of models that 
are going to be considered. 
Let 

Tjjj] := {6 e R k : min|^| > fj} 

j 

denote the set of feasible non-zero rows. For each j E {0, 1, . . . , k}, let k) be the class 
of all the subsets of {1, . . . , k} of cardinality j. Let 

M[M = {(0i,---A)'e^ Xfc : 9i = { G Vi%l } (7) 

be the class of all feasible matrix means. For a matrix M S M[/x, s], let Pm denote the joint 
law of {Yij}i£[ p ]je[k]- Since Pm is a product measure, we can write Pm = ®ie[p]^0i- For a 
non-zero row 0j, we set 

¥g.(A) = J Af(A; 6, a 2 I k )dv(6), A G B(R h ), 

where v is the distribution of the random variable Xljefc /• i u£j e j w ith £j ~ Bernoulli(£; - ^) 
and {ej}j 6 [fc] denoting the canonical basis of R fc . For a zero row Q{ = 0, we set 

P (A) =Af(A;0,a 2 I k ), A G B(R k ). 
With this notation, we have the following results. 
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Theorem 1 Let 




/i mm (n,k,p,s,e,f3) =ln(l + u+ \j2u + u 2 )a 2 



(8) 



where 



u = 




a 



2 {p-s + l) 



2 



> 



2k 1 - 2 ? 



If a £ (0, i) and k "u < 1, then for all fj, < fj, 



inf sup ¥ M [S(p) + S(M)} > -(1 - a) 



(9) 



A 4 M€M[fts] 



where M.\p,,s] is given by (J7J). 

The result can be interpreted in words in the following way: whatever the estimation pro- 
cedure //, there exists some matrix M G M[/i m j n , s] such that the probability of incorrectly 
identifying the support S(M) is bounded away from zero. In the next section, we will see 
that some estimation procedures achieve the lower bound given in Theorem [TJ 

3. Upper bounds on the support recovery 

In this section, we present sufficient conditions on (n,p, k, e, fi m i n ) for different estimation 
procedures, so that 



Let a', 5' > be two parameters such that a' + 5' = a. The parameter a' controls the 
probability of making a type one error 



that is, the parameter a' upper bounds the probability that there is a zero row of the matrix 
M that is estimated as a non-zero row. Likewise, the parameter 5' controls the probability 
of making a type two error 



that is, the parameter 5' upper bounds the probability that there is a non-zero row of the 
matrix M that is estimated as a zero row. 

The control of the type one and type two errors is established through the tuning 
parameter A. It can be seen that if the parameter A is chosen such that, for all i £ 5, it 
holds that F[i S(fi)] < S'/s and, for all i E S c , it hold that F[i G S(fi)] < a'/(p-s), then 
using the union bound we have that ¥[S(J1) ^ S] < a. In the following subsections, we will 
use the outlined strategy to choose A for different estimation procedures. 



P[S(£) ^ S] < a. 



¥[3i G [p] : t e S(fi) and % £ S] < a' 



F[3i G [p] : i g S(fi) and i G S] < 5' 
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3.1 Upper bounds for the Lasso 

Recall that the Lasso estimator is given as 



£ £l =argmin -||Y - fj,\\ 2 F + A||/i||i. (10) 

3Xfe 2 



It is easy to see that the solution of the above estimation problem is given as the following 
soft-thresholding operation 

£g = (l-^) Y ij} (11) 

where (x)_|_ := max(0, x). From (jlip . it is obvious that i £ S^/x^ 1 ) if and only if the 
maximum statistics, defined as 

M k (i) = max \Yij\, 
j 

satisfies Mk(i) > A. Therefore it is crucial to find the critical value of the parameter A such 
that 

P[M fc (i) < A] < S'/s i £ S 

P[M fc (i) > A] < a'/(p -s) i £ S c . 

We start by controlling the type one error. For i £ 5 C it holds that 

P[M fc (z) > A] < fcP[|AA(0,a 2 )| > A] < ^exp ( - ^) (12) 

using lemma [71 Setting the right hand side to a' / (p — s) in the above display, we obtain 
that A can be set as 

A = J 2 ln^fc^ (13) 



and (1121) holds as soon as 2 In 2fc / p s ) > 1. Next, we deal with the type two error. Let 

' ' V2wa' 



7r k =F[\{l-e)M{0,a 2 ) + eN(p min ,cJ 2 )\ > A]. (14) 

Then for i £ S, ¥[Mk(i) < A] < P[Bin(/c, iik) = 0], where Bin(/c,7Tfc) denotes the binomial 
random variable with parameters (fc,7Tfc). Control of the type two error is going to be 
established through careful analysis of TTk for various regimes of problem parameters. 

Theorem 2 Let A be defined by (fT3j) . Suppose /i m in satisfies one of the following two cases: 
(i) /Umin = o~y/2r In k where 

2 



with 

In 2 (P"^) 
" lnfc 

and lim^oo C fciPiS £ [0,oo); 
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(ii) Umin > A when 

and k x -P/2 > \n(s/5'). 

Then 



In k 

hm - — = 

n->oo ln{p — S) 



P[S(/2 €l ) ^S]<a. 



The proof is given in £ j7.2i 

Now we can compare the lower bound on fJ^ in from Theorem [T] and the upper bound 
from Theorem (2) Without loss of generality we assume that a = 1 . We have that when 
/3 < 1/2 the lower bound is of the order O (in [k^~ 1 / 2 ln(p — s))) and the upper bound is 
of the order ln(/c(p — s)). Ignoring the logarithmic terms in p and s, we have that the lower 
bound is of the order 0(k^~ 1 / 2 ) and the upper bound is of the order 0(ln A;), which implies 
that the Lasso does not achieve the lower bound when the non-zero rows are dense. When 
the non-zero rows are sparse, f3 > 1/2, we have that both the lower and upper bound are 
of the order C(ln/c) (ignoring the terms depending on p and s). 

3.2 Upper bounds for the group Lasso 

Recall that the group Lasso estimator is given as 

£V^ 2=argmin 1||Y- M ||| + A^||6>i|| 2j (15) 
/xeR^ 2 ie[p] 

where Oi = {l^ij)j^\k\- The group Lasso estimator can be obta ined in a closed form as a 
result of the following thresholding operation (see, for example, Friedman et al. . 20 id ) 



where Y{. is the i th row of the data. From (|16p. it is obvious that i £ S{fi e ' 1 ^ 2 ) if and only 
if the statistic defined as 

satisfies Sk(i) > A. The choice of A is crucial for the control of type one and type two errors. 



We use the following result, which directly follows from Theorem 2 in iBaraudl (|2002l ) 



Lemma 3 Let {Yi = fi + c£j}j g [ n ] be a sequence of independent observations, where f = 

{fi}i£[ n ] is a sequence of numbers, & *~ A/"(0, 1) and a is a known positive constant. Suppose 
that t nj a E E satisfies P[x^ > t n ,a] < ct. Let 

^ = /{^y 4 2 >t n , Q a 2 } 

ie [n] 

be a test for f = versus f / 0. Then the test cp a satisfies 

= 1] < a 
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when / = and 

P[0a = 0] < 5 

for all f such that 

||/||i>2(V5 + 4) < 7 2 ln^ yfr. 

Proof Follows immediately from Theorem 2 in iBaraudl d2002h . ■ 
It follows directly from lemma [3] that setting 

A = in,a'/(p-s)0" 2 (17) 

will control the probability of type one error at the desired level, that is, 

f[S k (i) > A] < a'/(p - s), Vi G S c . 
The following theorem gives us the control of the type two error. 
Theorem 4 Let A = i na / /(p.^cr 2 . Then 

F[S{n ei/£2 ) ^S]<a 

if 

/' 



r-— k -i/2+p 2e(2s - 5')(p - s) 



where c = ^2\n(2s/ 5') /k 1 -^ . 



The proof is given in £ 17.31 

Using Theorem [T] and Theorem [J] we can compare the lower bound on /i 2 ^ and the 
upper bound. Without loss of generality we assume that a = 1. When each non-zero row 
is dense, that is, when /3 < 1/2, we have that both lower and upper bounds are of the order 
CJ(/c^ _1 / 2 ) (ignoring the logarithmic terms in p and s). This suggest that the group Lasso 
performs better than the Lasso for the case where there is a lot of feature sharing between 
different tasks. Recall from previous section that the Lasso in this setting does not have 
the optimal dependence on k. However, when (3 > 1/2, that is, in the sparse non-zero row 
regime, we see that the lower bound is of the order 0(ln(k)) whereas the upper bound is of 
the order C^A;^ -1 / 2 ). This implies that the group Lasso does not have optimal dependence 
on k in the sparse non-zero row setting. 

3.3 Upper bounds for the group Lasso with the mixed (oo, 1) norm 

In this section, we analyze the group Lasso estimator with the mixed (oo, 1) norm, defined 
as 

£V^ = argmin I||Y - fj,\\% + A V (18) 

~-xfe 2 

ie\p] 



wher e 6i = (Hij)j£[k\- The closed form solution for can be obtained (see iLiu et al. 

20091 1. however, we are only going to use the following lemma. 
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Lemma 5 \hiu et all \200& ) ef l/£cc = if and only if £\ |Y^-| < X. 

Proof See the proof of Proposition 5 in iLiu et al.1 l|200g| ). 
Suppose that the penalty parameter A is set as 



X = ka\l2hi k(j) , S \ (19) 
V a' 

Then it follows directly from lemma [7] that 

P[V \Yij\ > A] < fcmaxP[|¥«| > \/k] < a'/(p - s), Vi G S c , 

. 3 

3 

which implies that the probability of the type one error is controlled at the desired level. 



/^min ^ ~ k ~*~^A 



Theorem 6 Let the penalty parameter A be defined by (JT9J) . Then 

F[S(p, ei/i °°) 7^ 5] < a 

if 

1 

w/iere c = v / 21n(2s/5 / )/fc 1 ^ /3 r = aJzkhx ^#/A. 

The proof is given in §7.41 

Comparing upper bounds for the Lasso and the group Lasso with the mixed (2, 1) norm 
with the result of Theorem [61 we can see that both the Lasso and the group Lasso have 
better dependence on k than the group Lasso with the mixed (oo, 1) norm. The difference 
becomes more pronounced as /3 increases. This suggest that we should be very cautious 
when using the group Lasso with the mixed (oo, 1) norm, since as soon as the tasks do not 
share exactly the same features, the other two procedures have much better performance 
on identifying the set of non-zero rows. 



4. Improved estimation procedure 

We have observed in the last section that the Lasso procedure performs better than the 
group Lasso when each non-zero row is sparse, while the group Lasso (with the mixed (2, 1) 
norm) performs better when each non-zero row is dense. Since in many practical situations 
one does not how much overlap there is between different tasks, it would be useful to 
combine the Lasso and the group Lasso in order to improve the performance. This can be 
simply done by estimating l S'(/x^ 1 ) using (I10p and Sifi 1 ' 1 ^ 2 ) using (|15|) separately. Finally, 
we can combine these estimates by taking their union S = S{p, h )^S{p, ll/l2 ). The outlined 
approach has the advantage that one does not need to know in advance which estimation 
procedure to use. From the theoretical analysis of the Lasso and the group Lasso, we can 
see that controlling the error of omitting a non-zero row is more difficult that controlling the 
probability of falsely including a zero row. Therefore, combining the Lasso and the group 
Lasso estimate can be seen as a way to increase the power to detect the non-zero rows. 
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5. Simulation results 

We conduct a small-scale empirical study of the performance of the Lasso and the group 
Lasso (both with the mixed (2, 1) norm and with the mixed (oo, 1) norm). Our empirical 
study shows that the theoretical findings of describe sharply the behavior of procedures 
even for small sample studies. In particular, we demonstrate that as the minimum signal 
level /x m in varies in the model ([3]), our theory sharply determines points at which probability 
of identifying non-zero rows of matrix M successfully transitions from to 1 for different 
procedures. 

The simulation procedure can be described as follows. Without loss of generality we 
let S = [s] and draw the samples {^ijliefd je[fc] according to the model in (J3j) . The total 
number of rows p is varied in {128,256,512,1024} and the number of columns is set to 
k = [p^og 2 (p)\ ■ The sparsity of each non-zero row is controlled by changing the parameter 
P in {0,0.25,0.5,0.75} and setting e = k~@ . The number of non-zero rows is set to s = 
[log 2 (p)J , the sample size is set to n = O.lp and o"o = 1. The parameters a' and 8' are both 
set to 0.01. For each setting of the parameters, we report our results averaged over 1000 
simulation runs. Simulations with other choices of parameters n, s and k have been tried 
out, but the results were qualitatively similar and, hence, we do not report them here. 



We investigate the performance on the Lasso for the purpose of estimating the set of non- 
zero rows, S. Figure Q] plots the probability of success as a function of the signal strength. 
On the same figure we plot the probability of success for the group Lasso with both (2, 1) 
and (oo, l)-mixed norms. Using theorem [21 we set 



where r is defined in theorem [2l Next, we generate data according to ([3]) with all elements 
set to p = p/iiasso, where p € [0.05,2]. The penalty parameter A is chosen as in (fl3|) . 
Figure Q] plots probability of success as a function of the parameter p, which controls the 
signal strength. This probability transitions very sharply from to 1. A rectangle on a 
horizontal line represents points at which the probability ¥[S = S] is between 0.05 and 
0.95. From each subfigure in Figure [TJ we can observe that the probability of success for the 
Lasso transitions from to 1 for the same value of the parameter p for different values of p, 
which indicated that, except for constants, our theory correctly characterizes the scaling of 
/x m i n . In addition, we can see that the Lasso outperforms the group Lasso (with (2, l)-mixed 
norm) when each non-zero row is very sparse (the parameter /3 is close to one). 

5.2 Group Lasso 

Next, we focus on the empirical performance of the group Lasso with the mixed (2, 1) norm. 
Figure [2] plots the probability of success as a function of the signal strength. Using theorem 
HI we set 



5.1 Lasso 



Masso 



y/2{r + 0.001) In /c 



(20) 




(21) 
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Probability of successful support recovery: Lasso 

Sparsity parameter: f} = Sparsity parameter: p, = 0.25 
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Figure 1: The probability of success for the Lasso for the problem of estimating 5 plotted 
against the signal strength, which is varied ctS cl multiple of /i-lasso 

defined in (1201) . 

A rectangle on each horizontal line represents points at which the probability 
¥[S = S] is between 0.05 and 0.95. Different subplots represent the probability 
of success as the sparsity parameter (3 changes. 



where c is defined in theorem |U Next, we generate data according to (|3j) with all elements 
{fiij} set to fx = /9/igroup, where p £ [0.05,2]. The penalty parameter A is given by (PT7|) . 
Figure plots probability of success as a function of the parameter p, which controls the 
signal strength. A rectangle on a horizontal line represents points at which the probability 
¥[S = S] is between 0.05 and 0.95. From each subfigure in Figure [2 we can observe that 
the probability of success for the group Lasso transitions from to 1 for the same value of 
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Probability of successful support recovery: group Lasso 

Sparsity parameter: f} = Sparsity parameter: p, = 0.25 



p = 1024 
Group Lasso p = 512 

(oo,l) p = 256 

p = 128 
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(2,1) p = 256 

p = 128 



Lasso 
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p = 256 
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Sparsity parameter: f) = 0.5 Sparsity parameter: (3 = 0.75 



p = 1024 
Group Lasso p = 512 

(oo,l) p = 256 

p = 128 
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(2,1) p = 256 
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Lasso 
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p = 256 
p = 128 



0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 

Signal strength ( x Signal strength ( x n„ our ) 



Figure 2: The probability of success for the group Lasso for the problem of estimating 
S plotted against the signal strength, which is varied as a multiple of ^ gr0 up 
defined in (|2ip . A rectangle on each horizontal line represents points at which 
the probability ¥[S = S] is between 0.05 and 0.95. Different subplots represent 
the probability of success as the sparsity parameter /3 changes. 



the parameter p for different values of p, which indicated that, except for constants, our 
theory correctly characterizes the scaling of [i m i n . We observe also that the group Lasso 
outperforms the Lasso when each non-zero row is not too sparse, that is, when there is a 
considerable overlap of features between different tasks. 
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5.3 Group Lasso with the mixed (oo, 1) norm 

Next, we focus on the empirical performance of the group Lasso with the mixed (oo, 1) 
norm. Figure [3] plots the probability of success as a function of the signal strength. Using 
theorem [6l we set 

Minfty = (22) 

where r and c are defined in theorem [6] and A is given by (|19p . Next, we generate data 
according to ([3]) with all elements {pij} set to p = pp\ n tty, where p € [0.05, 2]. Figure [3] plots 
probability of success as a function of the parameter p, which controls the signal strength. 
A rectangle on a horizontal line represents points at which the probability ¥[S = S] is 
between 0.05 and 0.95. From each subfigure in Figure El we can observe that the probability 
of success for the group Lasso transitions from to 1 for the same value of the parameter 
p for different values of p, which indicated that, except for constants, our theory correctly 
characterizes the scaling of p m \ a . We also observe that the group Lasso with the mixed 
(oo, 1) norm never outperforms the Lasso or the group Lasso with the mixed (2, 1) norm. 

6. Discussion 

We have studied the benefits of task sharing in sparse problems. Under many scenarios, the 
group lasso outperforms the lasso. The I1/I2 penalty seems to be a much better choice for 
the group lasso than the i\/£oo- However, as pointed out to us by Han Liu, for screening, 
where false discoveries are less important than accurate recovery, it is possible that the 
£i/£oo penalty could be useful. 

We focused on the Normal means model. While this model is obviously a simplified 
model, it is extremely useful for theoretical study. The Normal means model is commonly 
used in Statistics and we hope that this paper encourages researchers in machine learning 
to consider wider use of this model as well. 

7. Proofs 

This section collects technical proofs of the results presented in the paper. Throughout the 
section we use c% , c%, . . . to denote positive constants whose value may change from line to 
line. 

7.1 Proof of Theorem [TJ 

Without loss of generality, we may assume a = 1. Let (p(u) be the density of A^(0, 1) and 
define Po and Pi to be two probability measures on M fc with the densities with respect to 
the Lebesgue measure given as 

/ (oi, . . . ,a k ) = ] [ <j)(aj) (23) 
je[k] 

and 

/i(oi, . . . , a k ) = K z E m E£ Yl H a j ~ ^min) IJ <f>(a,j) (24) 
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Probability of successful support recovery: group Lasso with the mixed (00, 1) norm 

Sparsity parameter: p= 0.25 
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Figure 3: The probability of success for the group Lasso with mixed (oo, 1) norm for the 
problem of estimating S plotted against the signal strength, which is varied as a 
multiple of ^i n ft y defined in (|22j) . A rectangle on each horizontal line represents 
points at which the probability P[5 = S] is between 0.05 and 0.95. Different 
subplots represent the probability of success as the sparsity parameter (3 changes. 



where Z ~ Bin(/c,A; m is a random variable uniformly distributed over A4(Z,k) and 
{£,j}j£[k} is a sequence of Rademacher random variables, independent of Z and m. A 
Rademacher random variable takes values ±1 with probability ^. 

To simplify the discussion, suppose that p — s + 1 is divisible by 2. Let T = (p — s + 1)/2. 
Using Po and Pi, we construct the following three measures, 
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3€{s,...,p} 
3 odd 



and 



It holds that 



jG{s,...,p} 
j even 



inf sup P M [S(M) / 5(£)] > inf max (Q (* = l),Qi(* = 0) 

V- MeM * ^ 

"lo-Qilli, 



(25) 



> 



2 T 



where the infimum is taken over all tests taking values in {0, 1} and || • ||i is the total 
variation distance between probability measures. For a readable introdu ction on lower 
bounds on the minimax probability of error, see Section 2 in iTsvbakovl (120091 ) . In particular, 
our approach is related to the one described in Section 2.7.4. We proceed by upper bounding 
the total variation distance between Qq and Qi. Let g = dFx/dPo and let it, £ K fc for each 
i G [p], then 



-(ui, ...,u p ) 



\ e n ^r(^) n ^s^Ttb^ n 



T ^— ' cflri 

3 e(s,..., P } ie{l,...,s-l} 

j even 

T E 9 ^ 

je{s,...,p} 
j even 



d¥ 

dPo v ~*' dPo'"-" - LJ - tflP 
ie{«,...j-i} ie{i+i,... lP } 



and, similarly, we can compute 



. The following holds 



e sKo- e | n 

\ j£{s,---,p} je{s,...,jj} / 

j even j odd 



^ / ( E - E s( 



je{s,...,p} 

j even 



ie{s,...,p] 

j odd 



ie{s,...,p} 

n ^0^) 

ie{s,...,p} 



= ^(Po(5 2 )-l), 
where the last equality follows by observing that 



/ E E 9(uj)gM II d¥ ( Ui )=T¥ (g 2 ) + T 2 -T 

je{s,...,p} j'e{s,...,p) ie{s,...,p} 



j even jl evBn 



(26) 
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and 



je{s,...,p} j'G{ s ,...,p} ie{s,...,r>) 



j even jl odd 



Next, we proceed to up per bound F (g 2 ), using some ideas presented in the proof of Theo- 
rem 1 in iBaraudl ( 20021 ) . Recall definitions of /o and fx in (|23p and (|24jl respectively. Then 
<? = dPi/dPo = /1//0 and we have 



g(ax,. ■■,a k ) = E z E m E f 
= E Z 



exp 



+ ^ 



J6m 



exp ( - ^f^) E m [ [J cosh(/i min a J ) 



Furthermore, let Z' ~ Bin(fc, k~@) be independent of Z and m' uniformly distributed over 
M(Z',k). The following holds 

Po(ff 2 ) 

(Z + Z')^ min 



I ^Z',Z 



exp 



E m , m ' cosh(/i min Oj) ] [ cosh(/x min aj) 

jdm j£m' 



E 



Z'Z 



exp 



(Z + Z')n 



l\,.2 
min 



Y\ J cosh(ij, min aj)(t)(aj)daj 

j£m/\m' 

where we use mAm' to denote (m U m')\(m n m'). By direct calculation, we have that 



cosh 2 ([i m i n aj)4>(aj)daj = exp (/x 2 ^) cosh (/x 



mm ) 



and 



J cosh(^ min a J )0(a J )(ia J - = exp(/x min /2). 

Since ^\mAm'\ + |m n m'| = (Z + Z')/2, we have that 

Po(<7 2 ) = E*,* [( cosh(^ nm )) |mnm ' 1 

k 

= ^Z,Z' [%2Pj{ cosh (^min)) 
3=0 

= E ZtZ , [E x [cosh(^ in )- 



where 



Pj 



if j < Z + Z' - k or j > min(Z, Z') 



otherwise 
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and P[X = j] = pj. Therefore, X follows a hypergeometric distribution with parameters 
k, Z, Z'/k. [The first parameter denotes the total number of stones in an urn, the second 
parameter denotes the number of stones we are going to sample without replacement from 
the urn a nd the last pa rameter denotes the frac t ion o f white stones in the urn.] Then 
following ( Aldousl . 19851 . p. 173; see also Baraud ( 20021 )). we know that X has the same 
distribution as the random variable E[X|7~] where X is a binomial random variable with 
parameters Z and Z'/k, and T is a suitable cr-algebra. By convexity, it follows that 



W) < E z , z 
= ^z,z 
= E Z ,E Z 



cosh(/i^ in ) x 



Z' 



exp ( Z In ( 1 + — ( cosh(/i^ in ) - 1 



Z' 



exp ( Z In ( 1 + — u 



where ^ iri = ln(l + u + y/2u + u 2 ) with 

In (l + e££ 

II = 

2/t 1 -^ 

Continuing with our calculations, we have that 

P (# 2 ) = E z , exp (jfe In (1 + AT (1+/3 W) 

< E z > exp (k-^uZ'^j 

= exp ^k In (l + k' 13 ( exp(/c" /3 n) - 1)) \ 

< exp (^(exp (k~ fi u) - 1)) 

< exp ( 2k 1 ' 213 u 



(27) 



a 2 T 
= 1 + — 

where the last inequality follows since k~^u < 1 for all large p. Combining (|27|) with (|26p . 
we have that 

HQo-Qilli < «, 



which implies that 



inf sup F M [S(M) / S@)] >\~\a. 

M MeM * ^ 



7.2 Proof of Theorem [2] 

Without loss of generality, we can assume that a = 1 and rescale the final result. For A 
given in Cd, it holds that P[|A/"(0, 1) > A] = o(l). For the probability defined in (JUJ), we 
have the following lower bound 



7T fe = (1 - e)F[|AA(0, 1)| > A] + eP[|AA(/i min , 1)| > A] > eP[Af(/i min , 1) > A]. 
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We prove the two cases separately. 

Case 1: Large number of tasks. By direct calculation 

n k > 6P[A/-(/, min , 1) > A] = ^ k-P-iVW-^-^Y =: n k . 

V4vr log k (yl + C k ,p, s ~ y/r) ~ 

Since 1 — f3 > + Ck, p , s — \A") 2 > using lemma[Sl P[Bin(/c, Tr k ) = 0] — > as n — > oo. We 
can conclude that as soon as kir k > ln(s/5'), it holds that P[S'( / u^ 1 ) ^ S] < a. 
Case 2: Medium number of tasks. When ^ m i n > A, it holds that 

k'i 3 

vr fc > eP[AA(/i min , 1) > A] > — . 

Using lemma [U we can conclude that as soon as k l ~P /2 > ln(s/<5'), it holds that F[S(jl ei ) + 
S] < a. 

7.3 Proof of Theorem H] 



Using lemrnad P[Bin(fc, k~ p ) < (1 - c)/c 1_/3 ] < 6' /2s for c = xj2\n(2s/ 5') /k 1 -^ . For i £ 5, 
we have that 



p:&w<ai<£ + (i-£ 



S*(i)<A| {ll«,lli>(l-c)k 1 -'VL,} 



Therefore, using lemma[3]with 5 = 5' /(2s — 5'), if follows that P[<Sfc(i) < A] < 5' /(2s) for all 
z £ 5 when 



r—z / k-i/2+p 2e(2s - 5>)(p - s) 

Since A = i n ,a7(p-s)°" 2 > P[£fc(z) > A] < c//(p — s) for all i £ S c . We can conclude that 
P[S(^ l/fe ) ^ S] < a. 

7.4 Proof of Theorem [6] 

Without loss of generality, we can assume that a = 1. Proceeding as in the proof of 
theorem H P[Bin(fc, k~P) < (1 - c)^ 1 "^] < 5' /2s for c = ^2 ln(2s/<5')//c 1 - /3 using lemma EJ 
Then for i £ # it holds that 

e/ r/ 

P£ l^-l < A] < — + (l - -)P[(1 - C^^min + Z k < A], 
J 

where z k ~ J\f(0,k). Since (1 — c)fc 1_/3 /i mm > (1 + r)A, the right-hand side of the above 
display can upper bounded as 

The above display gives us the desired control of the type two error, and we can conclude 
that W[S(fL ll/io °) ^S]<a. 
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Appendix A. 

We provide in this section some known results that are used in the paper. 
Lemma 7 Let X ~ Af(0, 1), then F[\X\ > A] < ^J=^ exp(-^). 

Proof Since x/X > 1 for x £ (A, oo), by direct calculation 
mrv \ i 1 Z" 00 / x 2 \ , 1 f°° x ( x 2 \ , 1 / A 2 



and P[|A"| > A] < 2P[X > A]. 

Lemma 8 If z k ~ Bin(fe, i/ien /or all k > 1 and all ir k £ (0, 1) ZioZcfe i/iat 

Pfo = 0] < exp(-fc7r fe ). 
Proof P[z fc = 0] = (1 - 7T k ) k = exp(-k logij^;)) = exp(-k(ir k + 0(ir 2 ))) < exp(-k7r k ). 



Lemma 9 t 'Chernojjt . 198ft) If z k ~ Bin(fe, irj.), then 

F[z k < kir k -t}< exp(-t 2 /(2kir k )) 

and 

F[z k > kir k +t]< exp(-t 2 /(2(kTr k + t/3))). 
Proof See lChernofj dl98lh . 
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